Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

Week 1

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Dec 12 20:24:50 2022"

I’m taking this course very close to what is (hopefully) the end of my doctoral studies, honestly mainly to collect some of the remaining credits that I need.

That said, the course does contain some interesting and potentially useful topics. I have familiarity with R, RStudio, R Notebooks, GitHub and so forth, but have just learned things as I’ve needed them and have never put any effort into actually studying things. I suppose I can learn a thing or two about how to properly use these tools; and similarly for all the analysis techniques that are included in the course syllabus.

I have a bit of a love-hate relationship with R; I used to be a Python coder, but at some point realized that I’ve ended up working mostly with R. I’m familiar enough with it that the first few chapters of the RHDS book seemed to mainly contain stuff that I already know.

I find it curious, actually, that the book teaches the tidyverse package from the very beginning. The fact that there’s an entire ‘sub-language’ within R, in the form of tidyverse, is one of the things that I really don’t enjoy that much. I like tidyverse, but things get confusing when tidyverse and base-R are mixed up (as they inevitably are, at times). But I guess this shows that tidyverse is the way to go in the future, and this is good to know.

Here’s the link to my GitHub repo.


Week 2

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Dec 12 20:24:50 2022"
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Okay, let’s start with reading the data:

The data contains information about the participants of a statistics course, their course points, and their learning strategies. We’ve previously filtered the data so that what’s remaining is points, age, gender, an index of attitudes towards statistics, and indices of deep, strategic, and surface learning strategies.

Some summaries. For numerical variables, let’s take the mean:

data %>%
  summarise(across(c(age, attitude, deep, stra, surf, points), ~ mean(.x)))
##        age attitude     deep     stra     surf   points
## 1 25.51205 3.142771 3.679719 3.121235 2.787149 22.71687

Strategy variables have been rescaled to range from 1 to 5, so the mean for all is somewhat above the midpoint of the scale. Points range from 0 to 30.

For gender, let’s look at proportions:

prop.table(table(data$gender))
## 
##         F         M 
## 0.6626506 0.3373494

Somewhat more females than males, it seems.

Let’s look at scatterplots:

ggpairs(data, mapping = aes(alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

With regards to what’s intended to be the outcome variable (points), we can see observe that there does not seem to be a gender difference, although at the high end of the scale there seems to be a bit of a surplus of males. The average participant is around 20, with naturally more older than younger students in the mix. The scatterplot for attitude vs points suggests a positive relationship, which is also evident in the corresponding correlation coefficient; the same applies to the negative relationship between deep and points. We can also see signs that surface has a negative relationship with attitude, deep, and stra.

Given that the data contains questions related to three different learning strategies, called surface, deep, and strategic, it’s natural to use these to predict the outcome variable. From a theoretical point of view, that is; if we just look at the correlations of the variables with the outcome variable, attitude seems like the strongest candidate. But we can only pick three, so here we go. So let’s fit a linear model and output a summary.

fit <- lm(points ~ deep + stra + surf, data = data)

summary(fit)
## 
## Call:
## lm(formula = points ~ deep + stra + surf, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1235  -3.0737   0.5226   4.2799  10.3229 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.9143     5.1169   5.260  4.5e-07 ***
## deep         -0.7443     0.8662  -0.859   0.3915    
## stra          0.9878     0.5962   1.657   0.0994 .  
## surf         -1.6296     0.9153  -1.780   0.0769 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.827 on 162 degrees of freedom
## Multiple R-squared:  0.04071,    Adjusted R-squared:  0.02295 
## F-statistic: 2.292 on 3 and 162 DF,  p-value: 0.08016

Alright, now let’s interpret this.

First, we have the estimates column. We’re assuming that in the population from which this sample has been drawn, each student’s score is given by

\(points = \beta_0 + \beta_1 \cdot deep + \beta_2 \cdot stra + \beta_3 \cdot surf + \epsilon\)

where \(\epsilon\) is the error term and represents effects not taken into account in the model.

With the model, we seek to estimate \(\beta_0 ... \beta_3\) based on the data we have. The estimates column gives these estimates. Now, we can imagine drawing an arbitrarily large set of samples like this and performing the same modeling exercise for all of them. We’d get different estimates in each case, and the standard error (Std.error) tells us how much, on average, the estimates differ from the actual parameter values. Strictly speaking, the what’s shown in the Std.error column is itself an estimate of the actual population standard error, given that it’s computed using the data set we have.

These we could use to construct confidence intervals: roughly speaking, for each estimate we can compute \(\hat{\beta_i} \pm 2 \cdot SE(\hat{\beta_i})\) based on a sample to get a 95% confidence interval. 95% of such CIs, computed for the set of arbitrarily many samples, contain the true population parameter.

The test statistic is related. Let’s assume, i.e. put forward the null hypothesis, for an estimate that the true population value of that parameter is exactly 0. We can compute the \(t\)-statistic using the estimate and its standard error, and, under the assumption that the true parameter is 0, the probability of observing a value of \(t\) that is as extreme (as far away from 0) as what has been observed. This probability, the \(p\)-value, is given in the rightmost column. So one could roughly interpret, say, the \(p\)-value of 0.0769 associated with the coefficient for surf as saying that, assuming that the true value is 0, there’s a roughly 8% chance of seeing a corresponding test statistic this, or more, extreme. Given that the test statistic is calculated using the parameter estimate, one can simplify this by saying that the \(p\)-value tells us about the likelihood of seeing a parameter estimate at least as extreme as what is observed.

Note that all of this holds if the assumptions of the linear regression model are met; if not, various errors may creep in.

So, the model tells us that for each increase in deep, points decrease by -0.7; for stra, they increase by 1; and for surf, they decrease by -1.6. We observe a roughly 40% probability of seeing a result this extreme, or more, if deep actually had no relationship with points at all; so one shouldn’t conclude too much based on this. For stra and surf, our inferences are a bit more sound.

The assignment says that explanatory variables that do not have a statistically significant relationship with the target variable should next be removed. ‘Statistical significance’ has no unambiguous meaning, but often something like \(p < 0.001\), \(p < 0.01\) or \(p < 0.05\) is used. None of the variables are statistically significant under these levels of confidence, but R helpfully marks, with a dot, all coefficients that are distinguishable from zero under the \(p < 0.1\) level of confidence.

So let’s roll with that and drop deep:

fit <- lm(points ~ stra + surf, data = data)

summary(fit)
## 
## Call:
## lm(formula = points ~ stra + surf, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.4574  -3.2820   0.4296   4.0737   9.8147 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.5635     3.3104   7.118 3.31e-11 ***
## stra          0.9635     0.5950   1.619    0.107    
## surf         -1.3828     0.8684  -1.592    0.113    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.822 on 163 degrees of freedom
## Multiple R-squared:  0.03634,    Adjusted R-squared:  0.02452 
## F-statistic: 3.074 on 2 and 163 DF,  p-value: 0.04895

Removing this variable means that the remaining variables lose all significance. So whereas ‘deep’ did not have a clear relationship with the outcome variable in the first model, it seems that controlling for it made the relationships between the other two variables and the outcome variable more visible. Given that there is a theoretical reason for keeping it in the mix, I’d opt for keeping the omitted variable instead.

Finally, we have the \(R^2\), given as “Multiple R-squared” and “Adjusted R-squared” in the model summaries. The R-squared is

\(R^2 = 1 - RSS/TSS\)

where \(TSS = \sum{(y_i - \bar{y})^2}\), in our case, the squared difference of each student’s score from the mean score, and \(RSS = \sum{(y_i - \hat{y_i})^2}\). So in the latter formula, we look at the difference between each participant’s actual score, and their predicted score, according to our model. So we look at how much variation there is in the data at the beginning (TSS). Then we calculate how much variance is left after we’ve predicted the target variable with our model (RSS). Finally, we take \(TSS - RSS\) to measure how much variation is left after the information from the model is taken into account; it’s complement, \(R^2\), thus tells us ‘how much of the variation in the target variable is explained by the explanatory variables’. It’s guaranteed to increase as more explanatory variables are added, which is my adjusted \(R^2\), which takes the complexity of the model into account, can be useful. In this case, \(R^2\) is in any case unimpressive at a couple of percents, meaning that the model doesn’t tell us very much.

Finally, using our latter model, let’s assess the fit. We make a number of assumptions with ordinary least squares linear regression. We assume that the target variably relates linearly to the explanatory variables; that is, we don’t need something like a quadratic transformation of the data. We assume that errors are independent of each other, i.e. knowing how much one participant’s predicted score differs from their actual score doesn’t tell us anything about some other participant. We assume that errors follow a normal distribution and do not depend on the level of the target variable.

Let’s plot some diagnostics:

par(mfrow = c(2,2))
plot(fit, which = c(1,2,5))

The assignment asks for residual versus fitted values, normal Q-Q plot, and a plot of residuals against leverage so we’ll skip the fourth (fitted versus standardized residuals). First, we should see no patterns in the plot of predicted values versus residuals. This is not completely true, as we can see signs that small and large predicted values are associated with smaller residuals. Note that the model only predicts values between 20 and (roughly) 26, and there is a cap for points at 30; perhaps partly because of this, the largest negative residuals are around -15, while the largest positive residuals are around 10. This is also visible in the Q-Q plot, which compares the distribution of residuals against the normal distribution. We can see that the distribution is not quite normal. Nevertheless, I don’t think that these diagnostics show any highly relevant violations of the assumptions. Finally, the residuals vs leverage plot should show us if there are any data points that have an outsized impact on the resulting model, i.e. such that removing them would substantially change it. For whatever reason, Cook’s distance of more than 1 is what we’re looking for; none of the data points come close to having that much leverage.

I think it’s worth noting, though, that strictly speaking ordinary least squares regression assumes the outcome variable to be unbounded, which it is not in this case. Generally speaking, this is not a highly useful model in any case.


Week 3

date()
## [1] "Mon Dec 12 20:25:01 2022"

Our data set contains information on the performance of the students of two courses, along with background variables as well as their alcohol consumption.

colnames(data)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "failures.x"
## [16] "schoolsup"  "famsup"     "paid.x"     "activities" "nursery"   
## [21] "higher"     "internet"   "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences.x"
## [31] "G1.x"       "G2.x"       "G3.x"       "failures.y" "paid.y"    
## [36] "absences.y" "G1.y"       "G2.y"       "G3.y"       "failures"  
## [41] "paid"       "absences"   "G1"         "G2"         "G3"        
## [46] "alc_use"    "high_use"

Variables ending in .x relate to a mathematics course, variables ending in .y to a Portuguese course. Similarly named variables without a suffix contain their mean or, in the case of non-numeric variables, come from the former data set.

‘Dalc’ and ‘Walc’ denote workday and weekend alcohol consumption, respectively, measured on an ordinal scale from 1 (very low) to 5 (very high). ‘alc_use’ is their mean, and ‘high_use’ is a binary variable equalling TRUE if ‘alc_use’ is 3 or higher.

We want to build a logistic regression model with four independent variables, so we should pick some variables that could predict the level of a student’s alcohol use. I’ll put forward the following hypotheses:

Let’s do some explorations.

par(mfrow = c(2,2))
qplot(x = famrel, col = high_use, data = data, geom = 'density')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

round(prop.table(table(data$romantic, data$high_use), margin = 1), 2)
##      
##       FALSE TRUE
##   no   0.69 0.31
##   yes  0.72 0.28
round(prop.table(table(data$sex, data$high_use), margin = 1), 2)
##    
##     FALSE TRUE
##   F  0.79 0.21
##   M  0.60 0.40
round(prop.table(table(data$higher, data$high_use), margin = 1), 2)
##      
##       FALSE TRUE
##   no   0.44 0.56
##   yes  0.71 0.29

At a glance, it seems like these match our expectations. The high use group is at least less likely to report very good family relationships. Those with a romantic relationship are less likely to use a lot of alcohol, although this effect doesn’t seem very big. Males seem clearly more likely to belong in the high use group in comparison to females, as do those who are not interested in higher education, in comparison to those who are.

Let’s with a logistic regression model next.

fit <- glm(high_use ~ sex + romantic + higher + famrel, data = data, family = binomial)
summary(fit)
## 
## Call:
## glm(formula = high_use ~ sex + romantic + higher + famrel, family = binomial, 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4479  -0.8779  -0.6846   1.2175   2.0105  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.9173     0.7489   1.225 0.220626    
## sexM          0.9073     0.2412   3.762 0.000168 ***
## romanticyes  -0.2167     0.2615  -0.828 0.407447    
## higheryes    -0.9270     0.5417  -1.711 0.087006 .  
## famrel       -0.3305     0.1264  -2.614 0.008950 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 426.00  on 365  degrees of freedom
## AIC: 436
## 
## Number of Fisher Scoring iterations: 4

We see strongly significant effect for the male gender, and for family relationships (p < 0.001 and p < 0.01, respectively). Having a romantic relationship seems to have no effect. Interestingly, aiming for a higher education has only a weakly significant relationship in this model. Maybe this relates to females being more likely to aim for a higher education. Let’s see:

round(prop.table(table(data$sex, data$higher), margin = 1), 2)
##    
##       no  yes
##   F 0.02 0.98
##   M 0.07 0.93

Hmm, could be that this has something to do with it. Now, let’s interpret the coefficients and look at the confidence intervals.

fit %>% tidy(conf.int = TRUE, exp = TRUE)
## # A tibble: 5 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)    2.50      0.749     1.22  0.221       0.580    11.1  
## 2 sexM           2.48      0.241     3.76  0.000168    1.55      4.00 
## 3 romanticyes    0.805     0.262    -0.828 0.407       0.478     1.34 
## 4 higheryes      0.396     0.542    -1.71  0.0870      0.132     1.14 
## 5 famrel         0.719     0.126    -2.61  0.00895     0.560     0.920

The odds ratio for male gender is around 2.5. The way of interpreting this that seems the most intuitive to me is to note that the coefficient for the intercept is, also, around 2.5. The intercept corresponds to a student who is female, is not in a romantic relationship, is not aiming for higher education, and whose family relationship score is zero. The coefficient of 2.5 matches a \(2.5/(2.5 + 1) = 0.71\) probability of being a heavy alcohol user. Now, if we learn that such a person is male instead of female, their odds of being in the high-use group are 2.5 times higher, i.e. \(2.5 \cdot 2.5 = 6.25\), which corresponds to a \(6.25/(6.25 + 1) = 0.86\) probability of belonging in the heavy-use group.

The rest of the coefficients are all smaller than 1, indicating that they predict a lower probability of being a heavy alcohol user. However, the quality of family relationships is the only one for which we can state, at the \(95\%\) level of confidence, that they are indeed associated with a lower probability.

I don’t quite understand what is meant by the part “using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model”. So I’ll just go on and compare predictions against actual values from the model we just fitted.

Let’s classify a prediction as low use if predicted probability of belonging in the high-use group is smaller than 0.5, and as high use otherwise.

predicted <- round(predict(fit, type = 'response'))
table(data$high_use, predicted)
##        predicted
##           0   1
##   FALSE 251   8
##   TRUE   97  14

97 observations that actually belonged in the high-use group were inaccurately predicted as belonging in low-use group, and 8 low-users were classified as low-users. So the error rate is

(8 + 97) / (251 + 8 + 97 + 14)
## [1] 0.2837838

Note that the proportion of students who belong in the high-use group is also around 0.3:

data %>%
  summarise(sum(high_use) / 370)
##   sum(high_use)/370
## 1               0.3

So if we just classified everyone in the high-use group, our error rate would be approximately the same.


Week 4

date()
## [1] "Mon Dec 12 20:25:03 2022"
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select

Let’s load the Boston data set. It contains information about areas around the city of Boston, including variables such as the per capita crime rate and the mean number of rooms per dwelling.

data(Boston)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

We have 14 variables, named as above.

There’s a lot of variables here. But we can see, for instance, that crime rate (crim) correlates with just about everything, as does apartment value (medv). All in all, there are a lot of rather strong relationships in the data. Now, let’s standardize the data.

Also, here’s a summary of the data:

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Everything now has a mean of 0. We can conveniently see that some variables are quite skewed, such as crim, while some others seem more neatly symmetric, such as rm.

Let’s turn crim into a factor.

boston_scaled$crim <- as.numeric(boston_scaled$crim)
crim_bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = crim_bins, include.lowest = TRUE, labels = c('low', 'med_low', 'med_high', 'high'))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled$crime <- crime

set.seed(1234)
n <- nrow(boston_scaled)

ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Let’s fit the LDA then.

correct_classes <- test$crime
test <- dplyr::select(test, -crime)

fit <- lda(crime ~ ., data = train)

The function given in the exercise set produces a really ugly plot, so let’s use the ggord package instead to create the biplot:

Next, let’s test our model.

predicted <- predict(fit, newdata = test)

table(correct = correct_classes, predicted = predicted$class)
##           predicted
## correct    low med_low med_high high
##   low       15       8        1    0
##   med_low    5      16        5    0
##   med_high   0      12       14    2
##   high       0       0        1   23

On the diagonal, we have correctly categorized observations. One can see that the model gets few observations totally wrong, and most guesses are wrong by a single category.

Finally, let’s try k-means.

data(Boston)
boston_standard <- as.data.frame(scale(Boston))

summary(dist(boston_standard))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
summary(dist(boston_standard, method = 'manhattan'))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
set.seed(123)

k_max <- 10

sums_of_squares <- sapply(1:k_max, function(x) kmeans(boston_standard, x)$tot.withinss)

qplot(x = 1:k_max, y = sums_of_squares, geom = 'line')

There’s no obvious best solution here, but since we need to pick one, we’ll go with 2 as that’s where the biggest drop happens.

model <- kmeans(boston_standard, 2)
clus <- model$cluster

For some reason ggpairs takes ages to plot this, so let’s go with pairs().

pairs(boston_standard, col = clus)

This isn’t that easy to interpret. Perhaps the most notable thing is that everything in the black cluster seems to have the minimum crime rate, whereas the red cluster includes everything else. Red cluster also seems to contain dwellings with lower prices, higher NO2 levels (nox), and have better access to highways (rad).


Week 5

date()
## [1] "Mon Dec 12 20:25:31 2022"

Our data set contains information on countries and regions, on the following dimensions:

We have also removed all rows with missing values, and observations that correspond to regions instead of countries (these are the last seven rows of the data frame).

Now let’s look at it:

data <- read.csv('data/human2.csv', row.names = 1)
summary(data)
##   seced_ratio       lfpr_ratio          eye             leb       
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       gni              mmr              abr         prc_parliament 
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

There are some interesting things to note here, for instance gross national income per capita ranges from a measly 581 to 123,124 (which is probably something like Switzerland). There also seem to be countries where adolescents hardly ever give birth, and countries where the rate is over 200 per 1000.

ggpairs(data)

Some of the variables are quite skewed (lev, gni, mmr, abr). There are some clear negative correlations (better seen in the correlation matrix plotted below), such as those between maternal mortality rate and adolescent birth rate, on the one hand, and expected years of education, life expectancy, female education, and GNI. Sensibly, these female-related variables also correlated with each other, as are the quality-of-life indicators.

ggcorrplot(cor(data))

Let’s perform a principal component analysis.

model <- prcomp(data)
summary(model)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000

So.. the first principal component explains practically all of the variance. This seems to suggest that GNI, by itself, explains everything.

biplot(model, choices = 1:2, cex = c(0.3, 0.7))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

But this changes a lot the data are standardized:

data_scaled <- scale(data)
model_scaled <- prcomp(data_scaled)
summary(model_scaled)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000

Now the first principal component explains a bit more than half of the variance.

biplot(model_scaled, choices = 1:2, cex = c(0.3, 0.7), ylabs = c('Secondary education', 'Labour participation', 'Expected education', 'Life expectancy', 'GNI', 'Maternal mortality', 'Adolescent births', 'Female representation'))

This is not too legible, but we can sort of see that the first principal component seems to relate to general development indicators, where GNI and education are negatively correlated with factors related to reproduction: maternal mortality, and how often adolescents give birth. The second principal component appears to relate to the (socio-economic) status of women, i.e. how well-represented women are in politics and how often they participate in the workforce. It’s interesting, actually, that these form two separate dimensions, as the socio-economic status of women seems obviously also related to reproduction. An interpretation would be that fixing issues related to reproductive rights helps, but the issues related to women’s status are still an dimension of problems that need solving.

Finally, let’s move on to the tea data set.

tea <- read_csv('https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv')
## Rows: 300 Columns: 36
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (35): breakfast, tea.time, evening, lunch, dinner, always, home, work, t...
## dbl  (1): age
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tea
## # A tibble: 300 × 36
##    break…¹ tea.t…² evening lunch dinner always home  work  tearoom friends resto
##    <chr>   <chr>   <chr>   <chr> <chr>  <chr>  <chr> <chr> <chr>   <chr>   <chr>
##  1 breakf… Not.te… Not.ev… Not.… Not.d… Not.a… home  Not.… Not.te… Not.fr… Not.…
##  2 breakf… Not.te… Not.ev… Not.… Not.d… Not.a… home  Not.… Not.te… Not.fr… Not.…
##  3 Not.br… tea ti… evening Not.… dinner Not.a… home  work  Not.te… friends resto
##  4 Not.br… Not.te… Not.ev… Not.… dinner Not.a… home  Not.… Not.te… Not.fr… Not.…
##  5 breakf… Not.te… evening Not.… Not.d… always home  Not.… Not.te… Not.fr… Not.…
##  6 Not.br… Not.te… Not.ev… Not.… dinner Not.a… home  Not.… Not.te… Not.fr… Not.…
##  7 breakf… tea ti… Not.ev… Not.… Not.d… Not.a… home  Not.… Not.te… friends Not.…
##  8 Not.br… tea ti… evening Not.… Not.d… Not.a… home  Not.… Not.te… Not.fr… Not.…
##  9 breakf… tea ti… Not.ev… Not.… Not.d… Not.a… home  Not.… Not.te… Not.fr… Not.…
## 10 breakf… Not.te… evening Not.… Not.d… Not.a… home  Not.… tearoom Not.fr… Not.…
## # … with 290 more rows, 25 more variables: pub <chr>, Tea <chr>, How <chr>,
## #   sugar <chr>, how <chr>, where <chr>, price <chr>, age <dbl>, sex <chr>,
## #   SPC <chr>, Sport <chr>, age_Q <chr>, frequency <chr>,
## #   escape.exoticism <chr>, spirituality <chr>, healthy <chr>, diuretic <chr>,
## #   friendliness <chr>, iron.absorption <chr>, feminine <chr>,
## #   sophisticated <chr>, slimming <chr>, exciting <chr>, relaxing <chr>,
## #   effect.on.health <chr>, and abbreviated variable names ¹​breakfast, …
str(tea)
## spc_tbl_ [300 × 36] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ breakfast       : chr [1:300] "breakfast" "breakfast" "Not.breakfast" "Not.breakfast" ...
##  $ tea.time        : chr [1:300] "Not.tea time" "Not.tea time" "tea time" "Not.tea time" ...
##  $ evening         : chr [1:300] "Not.evening" "Not.evening" "evening" "Not.evening" ...
##  $ lunch           : chr [1:300] "Not.lunch" "Not.lunch" "Not.lunch" "Not.lunch" ...
##  $ dinner          : chr [1:300] "Not.dinner" "Not.dinner" "dinner" "dinner" ...
##  $ always          : chr [1:300] "Not.always" "Not.always" "Not.always" "Not.always" ...
##  $ home            : chr [1:300] "home" "home" "home" "home" ...
##  $ work            : chr [1:300] "Not.work" "Not.work" "work" "Not.work" ...
##  $ tearoom         : chr [1:300] "Not.tearoom" "Not.tearoom" "Not.tearoom" "Not.tearoom" ...
##  $ friends         : chr [1:300] "Not.friends" "Not.friends" "friends" "Not.friends" ...
##  $ resto           : chr [1:300] "Not.resto" "Not.resto" "resto" "Not.resto" ...
##  $ pub             : chr [1:300] "Not.pub" "Not.pub" "Not.pub" "Not.pub" ...
##  $ Tea             : chr [1:300] "black" "black" "Earl Grey" "Earl Grey" ...
##  $ How             : chr [1:300] "alone" "milk" "alone" "alone" ...
##  $ sugar           : chr [1:300] "sugar" "No.sugar" "No.sugar" "sugar" ...
##  $ how             : chr [1:300] "tea bag" "tea bag" "tea bag" "tea bag" ...
##  $ where           : chr [1:300] "chain store" "chain store" "chain store" "chain store" ...
##  $ price           : chr [1:300] "p_unknown" "p_variable" "p_variable" "p_variable" ...
##  $ age             : num [1:300] 39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : chr [1:300] "M" "F" "F" "M" ...
##  $ SPC             : chr [1:300] "middle" "middle" "other worker" "student" ...
##  $ Sport           : chr [1:300] "sportsman" "sportsman" "sportsman" "Not.sportsman" ...
##  $ age_Q           : chr [1:300] "35-44" "45-59" "45-59" "15-24" ...
##  $ frequency       : chr [1:300] "1/day" "1/day" "+2/day" "1/day" ...
##  $ escape.exoticism: chr [1:300] "Not.escape-exoticism" "escape-exoticism" "Not.escape-exoticism" "escape-exoticism" ...
##  $ spirituality    : chr [1:300] "Not.spirituality" "Not.spirituality" "Not.spirituality" "spirituality" ...
##  $ healthy         : chr [1:300] "healthy" "healthy" "healthy" "healthy" ...
##  $ diuretic        : chr [1:300] "Not.diuretic" "diuretic" "diuretic" "Not.diuretic" ...
##  $ friendliness    : chr [1:300] "Not.friendliness" "Not.friendliness" "friendliness" "Not.friendliness" ...
##  $ iron.absorption : chr [1:300] "Not.iron absorption" "Not.iron absorption" "Not.iron absorption" "Not.iron absorption" ...
##  $ feminine        : chr [1:300] "Not.feminine" "Not.feminine" "Not.feminine" "Not.feminine" ...
##  $ sophisticated   : chr [1:300] "Not.sophisticated" "Not.sophisticated" "Not.sophisticated" "sophisticated" ...
##  $ slimming        : chr [1:300] "No.slimming" "No.slimming" "No.slimming" "No.slimming" ...
##  $ exciting        : chr [1:300] "No.exciting" "exciting" "No.exciting" "No.exciting" ...
##  $ relaxing        : chr [1:300] "No.relaxing" "No.relaxing" "relaxing" "relaxing" ...
##  $ effect.on.health: chr [1:300] "No.effect on health" "No.effect on health" "No.effect on health" "No.effect on health" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   breakfast = col_character(),
##   ..   tea.time = col_character(),
##   ..   evening = col_character(),
##   ..   lunch = col_character(),
##   ..   dinner = col_character(),
##   ..   always = col_character(),
##   ..   home = col_character(),
##   ..   work = col_character(),
##   ..   tearoom = col_character(),
##   ..   friends = col_character(),
##   ..   resto = col_character(),
##   ..   pub = col_character(),
##   ..   Tea = col_character(),
##   ..   How = col_character(),
##   ..   sugar = col_character(),
##   ..   how = col_character(),
##   ..   where = col_character(),
##   ..   price = col_character(),
##   ..   age = col_double(),
##   ..   sex = col_character(),
##   ..   SPC = col_character(),
##   ..   Sport = col_character(),
##   ..   age_Q = col_character(),
##   ..   frequency = col_character(),
##   ..   escape.exoticism = col_character(),
##   ..   spirituality = col_character(),
##   ..   healthy = col_character(),
##   ..   diuretic = col_character(),
##   ..   friendliness = col_character(),
##   ..   iron.absorption = col_character(),
##   ..   feminine = col_character(),
##   ..   sophisticated = col_character(),
##   ..   slimming = col_character(),
##   ..   exciting = col_character(),
##   ..   relaxing = col_character(),
##   ..   effect.on.health = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
View(tea)

We have 300 observations and 36 variables. Now, let’s convert everything to factors:

tea <- tea %>%
  mutate(across(everything(), ~ as.factor(.x)))

The MCA package used in the exercise depends on a newer version of R than I have, and trying to update R would probably destroy everything, so that’s all for today, folks.


Week 6

date()
## [1] "Mon Dec 12 20:25:43 2022"

Part 1

## Rows: 176 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): ID, Group, weight, time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

All right, so we’re about to delve into an analysis of a data set from a study of how different diets being fed to rats affects their growth. The data set, which has already been converted into long form, includes

  • an ID for each rat
  • the group into which it was assigned
  • the time when it has been weighed (in days after the experiment started), and
  • its weight at that time.

There are 16 rats, and each rat was weighed 11 times, resulting in a total of 176 observations. The observations are dependent in several ways: some are weighings of the same rat, some have been taken at the same time, and some are weighings of rats belonging in the same group. In line with this week’s exercise set, we ignore these dependencies for now.

Let’s plot each individual rat’s growth curve as a line. Just to keep track of what’s actually going on, let’s use color to mark the group into which they belong.

ggplot(ratsl, aes(x = time, y = weight, group = ID, color = Group)) +
  geom_line() +
  scale_y_continuous('Weight (grams)') +
  scale_x_continuous('Time (days)')

I guess ‘day 1’ is actually already one week into the experiment, given that group 1 is already seen to diverge from the others at the very beginning of the period.

The next step in the exercise is grouping the observations and standardizing the value of interest. Let’s do that.

ratsl <- ratsl %>%
  group_by(Group) %>%
  dplyr::mutate(stdweight = (weight - mean(weight)) / sd(weight)) %>%
  ungroup()

This makes it a little easier to se that within each group, rats that were heavier at the start tend to be heavier as the experiment goes on.

ggplot(ratsl, aes(x = time, y = stdweight, group = ID)) +
  geom_line() +
  scale_y_continuous('Weight (grams)') +
  scale_x_continuous('Time (days)') +
  facet_wrap(~ Group)

The next step in the exercise was presenting summaries for the different treatment groups at different times.

ratss <- ratsl %>%
  group_by(Group, time) %>%
  summarise(mean = mean(weight), se = sd(weight) / n()) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
glimpse(ratss)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ time  <dbl> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ se    <dbl> 1.902697, 1.636634, 1.434488, 1.700102, 1.382185, 1.472971, 1.36…
ggplot(ratss, aes(x = time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  geom_point() +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype = "1"), width = 0.3) +
  scale_y_continuous(name = "mean(weight) +/- se(weight)") +
  scale_x_continuous(name = 'Time (days)')

The three groups appear quite distinct. Note the rather wide error bars for the second group. We’ve already seen in the last two plots that there is one rat in that group that’s considerably heavier than the rest. There’s likewise one in the other two groups each that diverges from the others, although not as drastically; and in the case of the first group, in particular, the larger \(n\) means this affects the standard error less.

Given that we can, nevertheless, observe differences between the groups, let’s just move on for now. What’s interesting here is whether the groups differ from each other: groups 2 and 3 from group 1, and groups 2 and 3 from each other. In the exercise set, observations from the first point in time are considered the baseline. We do not have a baseline in the same sense here, given that weight at the very beginning has not been recorded, so we keep all observations for now.

With this in mind, let’s then perform a pairwise t-test and correct for multiple comparisons.

rats_summaries <- ratsl %>%
  group_by(Group, ID) %>%
  summarise(mean = mean(weight)) %>%
  ungroup() 
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
pairwise.t.test(rats_summaries$mean, rats_summaries$Group, 
                p.adjust = 'bonferroni', data = rats_summaries)
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  rats_summaries$mean and rats_summaries$Group 
## 
##   1       2   
## 2 6.5e-07 -   
## 3 8.7e-08 0.41
## 
## P value adjustment method: bonferroni

We see that groups 2 and 3 clearly differ from group 1, but we can’t distinguish group 2 from group 3. But based on the visual investigation performed before, we perhaps should. This could be due to there being outliers, so let’s adapt the analysis from the exercise set for investigating this.

rats_summaries %>%
  ggplot(aes(x = Group, y = mean)) +
    geom_boxplot() +
    stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
    geom_point()

    scale_y_continuous(name = "Mean(weight)")
## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1

Yeah, we can see a few rats as standing out a little. Let’s try filtering out the potential outlier from each group.

rats_summaries %>%
  dplyr::filter(Group == 1 & mean > 250 | 
                  Group == 2 & mean < 500 |
                  Group == 3 & mean > 500) %>%
  ggplot(aes(x = Group, y = mean)) +
    geom_boxplot() +
    stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
    scale_y_continuous(name = "Mean(weight)")

rats_summaries_no_outliers <- rats_summaries %>%
  dplyr::filter(Group == 1 & mean > 250 | 
                Group == 2 & mean < 500 |
                Group == 3 & mean > 500)
pairwise.t.test(rats_summaries_no_outliers$mean, rats_summaries_no_outliers$Group, 
                p.adjust = 'bonferroni', data = rats_summaries)
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  rats_summaries_no_outliers$mean and rats_summaries_no_outliers$Group 
## 
##   1       2      
## 2 2.1e-12 -      
## 3 4.1e-14 1.5e-08
## 
## P value adjustment method: bonferroni

Indeed, with the potential outliers removed, we now also see a statistically significant difference between groups 2 and 3.

Now, let’s nevertheless perform the analysis from the exercise set, where a linear model with treatment group and baseline as predictors is fitted with mean weight as the dependent variable. The offending rats’ IDs are 2, 12 and 13, so we’ll remove those.

rats_baseline <- ratsl %>%
  dplyr::filter(ID != 2 & ID != 12 & ID != 13) %>%
  dplyr::filter(time == 1)

rats_for_lm <- rats_summaries %>%
  dplyr::filter(ID != 2 & ID != 12 & ID != 13) %>%
  mutate(baseline = rats_baseline$weight)

fit <- lm(mean ~ baseline + Group, data = rats_for_lm)

summary(fit)
## 
## Call:
## lm(formula = mean ~ baseline + Group, data = rats_for_lm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0310 -2.6287  0.1002  1.8269  7.1808 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 201.1903    25.7382   7.817 2.66e-05 ***
## baseline      0.2605     0.1010   2.580   0.0297 *  
## Group2      138.8380    17.0411   8.147 1.91e-05 ***
## Group3      199.6530    27.1916   7.342 4.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.669 on 9 degrees of freedom
## Multiple R-squared:  0.9989, Adjusted R-squared:  0.9985 
## F-statistic:  2692 on 3 and 9 DF,  p-value: 1.324e-13
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq  F value    Pr(>F)    
## baseline   1 174396  174396 7999.137 1.384e-14 ***
## Group      2   1707     853   39.147 3.628e-05 ***
## Residuals  9    196      22                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The analysis identifies not just diet as a relevant factor for rat weight, but also that all three diets differ from each other. It also finds that how much a rat weighed at the start is a contributing factor. That said, the analysis is pretty heavily geared toward finding some differences, given all the assumption violations and removal of ‘outliers’.

Part 2

Let’s move on to part two, and playing with the other data set.

library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
library(lmerTest)
## 
## Attaching package: 'lmerTest'
## The following object is masked from 'package:lme4':
## 
##     lmer
## The following object is masked from 'package:stats':
## 
##     step

Let’s load the data. To make some things a little simpler later on, we’ll modify subject IDs so that IDs in treatment group 1 range from 1 to 20, and those in treatment group 2 from 21 to 40. This data comes from a clinical trial, where 40 subjects (all male) were divided into two treatment groups, and a psychiatric scale (BPRS) was administered at the beginning, and weekly afterwards. The main variables of interest are the BPRS score, treatment group, and subject identifier.

## Rows: 360 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): treatment, subject, week, bprs
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 360
## Columns: 4
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ week      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ bprs      <dbl> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …

Again, let’s plot the data.

p1 <- ggplot(bprsl, aes(x = week, y = bprs, group = subject, color = treatment)) +
  geom_line() +
  #scale_y_continuous('Weight (grams)') +
  scale_x_continuous('Time (weeks)') +
  scale_y_continuous('BPRS')

p1

That’s a pretty mess. We can’t really see any meaningful differences between the groups. As already observed earlier, scores tend to go down for almost everyone in the study, with some exceptions.

Let’s move on to fitting models, as was done in the exercise set. We start with a simple linear model.

bprs_reg <- lm(bprs ~ week + treatment, data = bprsl)

summary(bprs_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = bprsl)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

We identify time as having an effect, but no difference between the treatments. Next up, a random intercept model with a separate intercept for each subject.

Note that I’m using the lmerTest package, which adds p-values to the model summary.

bprs_ref1 <- lmer(bprs ~ week + treatment + (1 | subject), data = bprsl, REML = FALSE)

summary(bprs_ref1)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: bprsl
## 
##      AIC      BIC   logLik deviance df.resid 
##   2582.9   2602.3  -1286.5   2572.9      355 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.27506 -0.59909 -0.06104  0.44226  3.15835 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept) 97.39    9.869   
##  Residual             54.23    7.364   
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  46.4539     2.3521  45.7607  19.750   <2e-16 ***
## week         -2.2704     0.1503 320.0000 -15.104   <2e-16 ***
## treatment2    0.5722     3.2159  40.0000   0.178     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.256       
## treatment2 -0.684  0.000

The standard deviation of the random intercept is close to 10, indicating that the position varies quite a bit (as could be deduced from the first plot we saw). We see almost exactly the same estimates for the coefficients of week, and for belonging in treatment group 2. The coefficient for week is again clearly significant, that for treatment 2 is not. The standard error for week is smaller, and larger for treatment 2, than in the simple OLS model.

We can plot this model:

p2 <- bprsl %>%
  mutate(ref1_fitted = fitted(bprs_ref1)) %>%
  ggplot(aes(x = week, y = ref1_fitted, group = subject, color = treatment)) +
    geom_line() +
    scale_x_continuous(name = 'Time (weeks)') +
    scale_y_continuous(name = 'Fitted value')

p2

Next, let’s fit a random intercept and random slope model. We’ll include random effects for week and subject. Now we have separate intercepts, but also separate slopes, for every individual subject.

bprs_ref2 <- lmer(bprs ~ week + treatment + (week | subject), data = bprsl, REML = FALSE)

summary(bprs_ref2)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: bprsl
## 
##      AIC      BIC   logLik deviance df.resid 
##   2523.2   2550.4  -1254.6   2509.2      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4655 -0.5150 -0.0920  0.4347  3.7353 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 167.827  12.955        
##           week          2.331   1.527   -0.67
##  Residual              36.747   6.062        
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)  45.9830     2.6470 49.9491  17.372  < 2e-16 ***
## week         -2.2704     0.2713 40.0001  -8.370 2.51e-10 ***
## treatment2    1.5139     3.1392 39.9998   0.482    0.632    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.545       
## treatment2 -0.593  0.000
anova(bprs_ref1, bprs_ref2)
## Data: bprsl
## Models:
## bprs_ref1: bprs ~ week + treatment + (1 | subject)
## bprs_ref2: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## bprs_ref1    5 2582.9 2602.3 -1286.5   2572.9                         
## bprs_ref2    7 2523.2 2550.4 -1254.6   2509.2 63.663  2  1.499e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We still see pretty similar fixed effects. Variance associated with subject is increased, meaning that in this model, the starting points of the slopes are even more spread out. The new model nevertheless fits the data better.

We can similarly plot this model. The lines tend to slope downwards, but with some having a positive slope, too.

p3 <- bprsl %>%
  mutate(ref2_fitted = fitted(bprs_ref2)) %>%
  ggplot(aes(x = week, y = ref2_fitted, group = subject, color = treatment)) +
    geom_line() +
    scale_x_continuous(name = 'Time (weeks)') +
    scale_y_continuous(name = 'Fitted value')

p3

Finally, we can add an interaction between week and treatment group.

bprs_ref3 <- lmer(bprs ~ week + treatment + (week | subject) + week * treatment, data = bprsl, REML = FALSE)

summary(bprs_ref3)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (week | subject) + week * treatment
##    Data: bprsl
## 
##      AIC      BIC   logLik deviance df.resid 
##   2523.5   2554.5  -1253.7   2507.5      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4747 -0.5256 -0.0866  0.4435  3.7884 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 164.204  12.814        
##           week          2.203   1.484   -0.66
##  Residual              36.748   6.062        
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##                 Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)      47.8856     2.9840 40.0010  16.047  < 2e-16 ***
## week             -2.6283     0.3752 40.0021  -7.006 1.84e-08 ***
## treatment2       -2.2911     4.2200 40.0010  -0.543    0.590    
## week:treatment2   0.7158     0.5306 40.0021   1.349    0.185    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.668              
## treatment2  -0.707  0.473       
## wek:trtmnt2  0.473 -0.707 -0.668
anova(bprs_ref2, bprs_ref3)
## Data: bprsl
## Models:
## bprs_ref2: bprs ~ week + treatment + (week | subject)
## bprs_ref3: bprs ~ week + treatment + (week | subject) + week * treatment
##           npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
## bprs_ref2    7 2523.2 2550.4 -1254.6   2509.2                    
## bprs_ref3    8 2523.5 2554.6 -1253.7   2507.5  1.78  1     0.1821

Note that this model does not appear better-fitting than the previous model, overall. The effect of time remains quite similar to the previous models; the effect of belonging in group 2 flips negative, but remains unsignificant. The interaction appears irrelevant. We can again plot this to see that the difference between this final model and the previous one is negligible.

p4 <- bprsl %>%
  mutate(ref3_fitted = fitted(bprs_ref3)) %>%
  ggplot(aes(x = week, y = ref3_fitted, group = subject, color = treatment)) +
    geom_line() +
    scale_x_continuous(name = 'Time (weeks)') +
    scale_y_continuous(name = 'Fitted value')

p4

To compare all four. Let’s also split each plot into facets for the two groups.

p1 + facet_wrap(~ treatment)

p2 + facet_wrap(~ treatment)

p3 + facet_wrap(~ treatment)

p4 + facet_wrap(~ treatment)

In the exercise set, we removed the dude in group 2 with an excessively high BPRS score. I’ve kept him in the mix for now, but let’s try removing him. We can see that he’s the one with a BPRS over 80 at the beginning, so let’s find him.

bprsl %>%
  dplyr::filter(bprs > 80)
## # A tibble: 1 × 4
##   treatment subject  week  bprs
##   <fct>     <fct>   <dbl> <dbl>
## 1 2         31          1    95
# Gotcha!

bprsl_f <- bprsl %>%
  dplyr::filter(subject != 31)

Let’s just fit the full model, with random intercepts, random slopes, and interactions, to see if it appears any different from the one fitted to the data with the ‘outlier’; if so, we can investigate further.

bprs_ref4 <- lmer(bprs ~ week + treatment + (week | subject) + week * treatment, data = bprsl_f, REML = FALSE)

summary(bprs_ref4)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (week | subject) + week * treatment
##    Data: bprsl_f
## 
##      AIC      BIC   logLik deviance df.resid 
##   2441.0   2471.9  -1212.5   2425.0      343 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.5065 -0.5146 -0.0825  0.4605  3.8352 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 132.141  11.495        
##           week          2.281   1.510   -0.71
##  Residual              35.769   5.981        
## Number of obs: 351, groups:  subject, 39
## 
## Fixed effects:
##                 Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)      47.8856     2.6986 38.9988  17.744  < 2e-16 ***
## week             -2.6283     0.3793 38.9988  -6.929 2.68e-08 ***
## treatment2       -4.2399     3.8663 38.9988  -1.097    0.280    
## week:treatment2   0.7476     0.5434 38.9988   1.376    0.177    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.720              
## treatment2  -0.698  0.503       
## wek:trtmnt2  0.503 -0.698 -0.720

Doesn’t seem so, we still don’t have see any statistically significant differences between the groups, or statistically significant interaction effects. The plot looks pretty similar to what we had before.

bprsl_f %>%
  mutate(ref4_fitted = fitted(bprs_ref4)) %>%
  ggplot(aes(x = week, y = ref4_fitted, group = subject, color = treatment)) +
    geom_line() +
    scale_x_continuous(name = 'Time (weeks)') +
    scale_y_continuous(name = 'Fitted value') +
    facet_wrap(~ treatment)

All in all, performing the analysis this way doesn’t seem to affect the conclusions.